An Information Theoretic Approach to Bilingual Word Clustering
نویسندگان
چکیده
We present an information theoretic objective for bilingual word clustering that incorporates both monolingual distributional evidence as well as cross-lingual evidence from parallel corpora to learn high quality word clusters jointly in any number of languages. The monolingual component of our objective is the average mutual information of clusters of adjacent words in each language, while the bilingual component is the average mutual information of the aligned clusters. To evaluate our method, we use the word clusters in an NER system and demonstrate a statistically significant improvement in F1 score when using bilingual word clusters instead of monolingual clusters.
منابع مشابه
An information theoretic approach for using word cluster information in natural language call routing
In this paper, an information theoretic approach for using word clusters in natural language call routing (NLCR) is proposed. This approach utilizes an automatic word class clustering algorithm to generate word classes from the word based training corpus. In our approach, the information gain (IG) based term selection is used to combine both word term and word class information in NLCR. A joint...
متن کاملBilingual Clustering Using Monolingual Algorithms
The use of bilingual word classes greatly reduces the amount of data needed for training subsequential transducers, a finite state model adequate for small to medium translation tasks. We present an automatic approach to derive these classes using traditional monolingual word clustering methods.
متن کاملUsing Similarity Scoring To Improve the Bilingual Dictionary for Word Alignment
We describe an approach to improve the bilingual cooccurrence dictionary that is used for word alignment, and evaluate the improved dictionary using a version of the Competitive Linking algorithm. We demonstrate a problem faced by the Competitive Linking algorithm and present an approach to ameliorate it. In particular, we rebuild the bilingual dictionary by clustering similar words in a langua...
متن کاملUsing Similarity Scoring to Improve the Bilingual Dictionary for Sub-sentential Alignment
We describe an approach to improve the bilingual cooccurrence dictionary that is used for word alignment, and evaluate the improved dictionary using a version of the Competitive Linking algorithm. We demonstrate a problem faced by the Competitive Linking algorithm and present an approach to ameliorate it. In particular, we rebuild the bilingual dictionary by clustering similar words in a langua...
متن کاملAutomated Generalization of Translation Examples
Previous work has shown that adding generalization of the examples in the corpus of an example-based machine translation (EBMT) system can reduce the required amount of pretranslated example text by as much as an order of magnitude for Spanish-English and FrenchEnglish EBMT. Using word clustering to automatically generalize the example corpus can provide the majority of this improvement for Fre...
متن کامل